Time Series Lab

Author

Zoe Zhou

Code
library(tidyverse)
library(here)
# library(lubridate)
library(tsibble) # new lib!
library(feasts)
library(fable)

Part 1: Time series with Toolik Lake data

Always look at your data

Toolik Station (LTER) meteorological data (Source: Source: Shaver, G. 2019. A multi-year DAILY file for the Toolik Field Station at Toolik Lake, AK starting 1988 to present. ver 4. Environmental Data Initiative.) See the Toolik Field Station Environmental Data Center site for more details.

Read in data:

Code
toolik_df <- read_csv(here("data", "toolik_daily.csv"))

Go ahead and try plotting the data as imported.

Code
ggplot(data = toolik_df, aes(x = date, y = daily_air_temp)) +
  geom_line()

### Booo we get a warning (only one observation per series)

Notice that it doesn’t work - because R doesn’t understand the date is a date until we tell it.

Convert the data frame to a tsibble

Let’s go ahead and convert it to a tsibble using the as_tsibble() function. First, we’ll need to convert the date to a date class, then convert to a tsibble. We could just keep it as a regular dataframe with a date column and do a lot with that, but the tsibble package gives lots of functionality around time series (thus the ts at the start).

Code
toolik_ts <- toolik_df %>% 
  mutate(date = lubridate::mdy(date)) %>% 
  as_tsibble(key = NULL, ### if we had multiple obs on same date from diff locations
             index = date) ### time index, here our column is `date`

Now let’s plot it:

Code
ggplot(data = toolik_ts, aes(x = date, y = daily_air_temp)) +
  geom_line() +
  labs(x = "Date",
       y = "Mean daily air temperature (Celsius)\n at Toolik Station")

We need to ask some big picture questions at this point, like:

  • Does there appear to be an overall trend?
  • Does there appear to be seasonality?
  • Does there appear to be cyclicality?
  • Any notable outliers or additional patterns?

Use filter_index() to filter by date-times!

We can use filter_index() specifically to help us filter data by time spans. See ?filter_index() for more information.

Formulas that specify start and end periods (inclusive), or strings.

~ end or . ~ end: from the very beginning to a specified ending period.

start ~ end: from specified beginning to ending periods.

start ~ .: from a specified beginning to the very end of the data. Supported index type: POSIXct (to seconds), Date, yearweek, yearmonth/yearmon, yearquarter/yearqtr, hms/difftime & numeric.

So for example “2010-12” ~ “2011-01” would filter from December 2010 through January 2011.

Code
toolik_ts %>% 
  filter_index("2010-12","2011-01") %>% 
  tail()
# A tsibble: 6 x 5 [1D]
  date       daily_air_temp daily_precip mean_barom mean_windspeed
  <date>              <dbl>        <dbl>      <dbl>          <dbl>
1 2011-01-26         -22.5            NA       911.           1.19
2 2011-01-27         -18.5            NA       926.           2.17
3 2011-01-28         -12.0            NA       933.           3.36
4 2011-01-29          -4.85           NA       928.           7.33
5 2011-01-30          -6.68           NA       925.           4.90
6 2011-01-31          -6.71           NA       918.           5.93

Try to filter the Toolik tsibble from April 10, 2006 to May 15, 2006.

Code
toolik_ts %>% 
  filter_index("2006-04","2006-05-15")
# A tsibble: 31 x 5 [1D]
   date       daily_air_temp daily_precip mean_barom mean_windspeed
   <date>              <dbl>        <dbl>      <dbl>          <dbl>
 1 2006-04-01          -7.33           NA         NA             NA
 2 2006-04-02          -9.52           NA         NA             NA
 3 2006-04-03         -19.0            NA         NA             NA
 4 2006-04-04         -26.0            NA         NA             NA
 5 2006-04-05         -27.6            NA         NA             NA
 6 2006-04-06         -27.5            NA         NA             NA
 7 2006-04-07         -21.6            NA         NA             NA
 8 2006-04-08         -15.4            NA         NA             NA
 9 2006-04-09         -15.8            NA         NA             NA
10 2006-04-10         -18.6            NA         NA             NA
# ℹ 21 more rows

Filter another index from December 20, 2020 to the end of the dataset.

Code
toolik_ts %>% 
  filter_index("2020-12-20"~.)
# A tsibble: 742 x 5 [1D]
   date       daily_air_temp daily_precip mean_barom mean_windspeed
   <date>              <dbl>        <dbl>      <dbl>          <dbl>
 1 2020-12-20         -39.0           0         909.          0.652
 2 2020-12-21         -26.2           0.2       912.          2.25 
 3 2020-12-22          -7.97          1.5       908.          6.33 
 4 2020-12-23          -6.79          2.2       901.          2.49 
 5 2020-12-24          -7.44          0.7       913.          4.51 
 6 2020-12-25          -8.26          0         918.          3.01 
 7 2020-12-26         -21.4           0.1       917.          1.59 
 8 2020-12-27         -20.4           0         916.          1.11 
 9 2020-12-28         -15.0           0.3       916.          1.20 
10 2020-12-29         -16.0           0.1       920.          1.22 
# ℹ 732 more rows

filter_index() also accepts multiple periods using commas to separate the intervals. Try to filter from December 2010 through January 2011, and from September 2, 2013 to October 31, 2014.

Don’t worry about saving the output to another object.

Code
toolik_ts %>% 
  filter_index("2010-12"~"2011-01", "2013-09-02"~"2014-10-31")
# A tsibble: 487 x 5 [1D]
   date       daily_air_temp daily_precip mean_barom mean_windspeed
   <date>              <dbl>        <dbl>      <dbl>          <dbl>
 1 2010-12-01          -21.8          0.5       928.           4.23
 2 2010-12-02          -23.4          0.3       927.           2.36
 3 2010-12-03          -17.9          0.2       921.           2.63
 4 2010-12-04          -24.0          1.7       919.           3.70
 5 2010-12-05          -31.7          0         936.           2.39
 6 2010-12-06          -23.5          0         940.           3.07
 7 2010-12-07          -21.3          0         936.           3.50
 8 2010-12-08          -22.1          0         932.           3.77
 9 2010-12-09          -18.7          2.2       930.           3.35
10 2010-12-10          -26.0          0.4       929.           3.37
# ℹ 477 more rows

Use index_by() to aggregate time series by increments

index_by() replaces group_by() for time series data. After the data is indexed, we can still use summary, but now we’re doing it across dates.

Code
toolik_month <- toolik_ts |>  
  index_by(yr_mo = ~yearmonth(.)) |> 
  summarize(monthly_mean_temp = mean(daily_air_temp, na.rm = TRUE)) |> 
  ungroup() ### just like after group_by()

Now let’s take a look:

Code
toolik_month %>% 
  ggplot(aes(x = year(yr_mo), y = monthly_mean_temp)) +
  geom_line() +
  facet_wrap(~month(yr_mo, label = TRUE)) +
  labs(x = "Year",
       y = "Annual mean air temperature (Celsius)",
       title = "Toolik Station mean annual air temperature",
       subtitle = "1988 - 2018")

Source: Shaver, G. 2019. A multi-year DAILY weather file for the Toolik Field Station at Toolik Lake, AK starting 1988 to present. ver 4. Environmental Data Initiative.

There are some other examples of index_by() in the fable package documentation. Look at the key for more examples.

Part 2: Time series wrangling & forecasting

To reinforce skills for wrangling, visualizing, and forecasting with time series data, we will use data on US residential energy consumption from January 1973 - September 2023 (from the US Energy Information Administration).

  • Dataset: U.S. Residential Energy Consumption (Jan 1973 - Sep 2023)
  • Units: Trillion BTU
  • Source: US Energy Information Administration (https://www.eia.gov/totalenergy/data/monthly/index.php)

Read in energy data and convert to a tsibble

Read in the energy.csv data (use here(), since it’s in the data subfolder).

Code
energy_df <- read_csv(here("data", "energy.csv"))

Analysis goal:

  • Examine patterns and trends in residential energy consumption over time

  • Predict where residential energy use patterns will be five years from now

Wrangle data into a time series format

Explore the energy_df object as it currently exists. Notice that there is a column yyyymm that contains the 4-digit year and 2-digit month. Currently, however, R understands that as a character (instead of as a date). Our next step is to convert it into a time series data frame (a tsibble), in two steps:

  1. Add a new column (date) that is the current month column converted to a time series class, yearmonth
  2. Convert the data frame to a tsibble, with that date column as the time index, and sector as key
Code
energy_ts <- energy_df %>% 
  mutate(date= yearmonth(yrmonth)) %>% 
  as_tsibble(index=date, 
             key=sector) # group by sector if same date

Exploratory time series visualization

Raw data graph

Let’s take a quick look at our tsibble (for residential energy use, in trillion BTU):

Code
ggplot(data = energy_ts, aes(x = date, y = energy_total, color = sector)) +
  geom_line() +
  labs(y = "Energy consumption by sector \n (Trillion BTU)")

Looks like there are some interesting things happening. Focus on residential:

  • Is there an overall trend? For residential, increase but platau
  • Is there seasonality? Yes, stronger in residential than industrial.
  • Any cyclicality evident? not really,
  • Any other notable patterns, outliers, etc.?

Seasonplot:

A seasonplot can help point out seasonal patterns, and help to glean insights over the years. We’ll use feasts::gg_season() to create an exploratory seasonplot, which has month on the x-axis, energy consumption on the y-axis, and each year is its own series (mapped by line color).

Code
energy_ts %>% 
  filter(sector == 'residential') %>%
  gg_season(y = energy_total, pal = hcl.colors(n = 9)) + 
  theme_minimal() +
  labs(x = "month",
       y = "residential energy consumption (trillion BTU)")

This is really useful for us to explore both seasonal patterns, and how those seasonal patterns have changed over the years of this data (1973 - 2023). What are the major takeaways from this seasonplot?

Let’s explore the data a couple more ways:

Subseries plot:

Code
energy_ts %>% gg_subseries(energy_total)

Our takeaway here is similar: there is clear seasonality (higher values in winter months), with an increasingly evident second peak in June/July/August. This reinforces our takeaways from the raw data and seasonplots.

Decomposition (here by STL)

See Rob Hyndman’s section on STL decomposition to learn how it compares to classical decomposition we saw in lecture: “STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using LOESS”, while LOESS is a method for estimating nonlinear relationships.” LOESS (“Locally estimated scatterplot smoothing”) uses a weighted moving average across all points in the dataset, weighted by distance from the point being averaged.

Notice that it allows seasonality to vary over time (a major difference from classical decomposition, and important here since we do see changes in seasonality).

Code
# Find STL decomposition
dcmp <- energy_ts %>%
  filter(sector == 'residential') %>%
  model(feasts::STL(energy_total ~ season(period = '1 year') + trend(window = 25)))


# Visualize the decomposed components
components(dcmp) %>% 
  autoplot() +
  theme_minimal()

Autocorrelation function (ACF)

We use the ACF to explore autocorrelation (here, we would expect seasonality to be clear from the ACF):

Code
energy_ts %>% 
  filter(sector == 'residential') %>%
  ACF(energy_total) %>% 
  autoplot()

And yep, we see that observations separated by 12 months are the most highly correlated, reflecting strong seasonality we see in all of our other exploratory visualizations.

Forecasting by Holt-Winters exponential smoothing

Note: here we use ETS, which technically uses different optimization than Holt-Winters exponential smoothing, but is otherwise the same (From Rob Hyndman: “The model is equivalent to the one you are fitting with HoltWinters(), although the parameter estimation in ETS() uses MLE.”)

To create the model below, we specify the model type (exponential smoothing, ETS), then tell it what type of seasonality it should assume using the season("") expression, where “N” = non-seasonal (try changing it to this to see how unimpressive the forecast becomes!), “A” = additive, “M” = multiplicative. Here, we’ll say seasonality is multiplicative due to the change in variance over time and also within the secondary summer peak, and trend is additive:

Code
# Create the model:
energy_fit <- energy_ts %>%
  # filter_index('2000-01' ~ .) %>% 
    ### try different date windows since trend seems to change 
  filter(sector == 'residential') %>%
  group_by_key(sector) %>%
  model(
    ets = ETS(energy_total ~ season(method = "M") + trend(method = "A"))
  )

# Forecast using the model 5 years into the future:
energy_forecast <- energy_fit %>% 
  forecast(h = "5 years")

# Plot just the forecasted values (with 80 & 95% CIs):
energy_forecast %>% 
  autoplot()

Code
# Or plot it added to the original data:
energy_forecast %>% 
  autoplot(energy_ts)

Assessing residuals

We can use broom::augment() to append our original tsibble with what the model predicts the energy usage would be based on the model. Let’s do a little exploring through visualization.

First, use broom::augment() to get the predicted values & residuals:

Code
# Append the predicted values (and residuals) to original energy data
energy_predicted <- broom::augment(energy_fit)

# Use View(energy_predicted) to see the resulting data frame

Now, plot the actual energy values (energy_total), and the predicted values (stored as .fitted) atop them:

Code
ggplot(data = energy_predicted) +
  geom_line(aes(x = date, y = energy_total)) +
  geom_line(aes(x = date, y = .fitted), color = "red", alpha = .7)

Now let’s explore the residuals. Remember, some important considerations: Residuals should be uncorrelated, centered at 0, and ideally normally distributed. One way we can check the distribution is with a histogram:

Code
ggplot(data = energy_predicted, aes(x = .resid)) +
  geom_histogram()